Manual On Setting Up, Using, And Understanding Random ...breiman/Using

您所在的位置:网站首页 how todeal with等于 Manual On Setting Up, Using, And Understanding Random ...breiman/Using

Manual On Setting Up, Using, And Understanding Random ...breiman/Using

2024-06-21 15:00| 来源: 网络整理| 查看: 265

Page 1: Manual On Setting Up, Using, And Understanding Random ...breiman/Using_random_forests_V3.1.pdf · To find out how this program works, read my paper "Random ... addition of a synthetic

Manual On Setting Up, Using, And Understanding Random Forests V3.1

The V3.1 version of random forests contains some modificationsand major additions to Version 3.0. It fixes a bad bug in V3.0. Itallows the user to save the trees in the forest and run other data setsthrough this forest. It also allows the user to save parameters andcomments about the run.

I apologize in advance for all bugs and would like to hear aboutthem. To find out how this program works, read my paper "RandomForests" Its available on the same web page as this manual. It wasrecently published in the Machine Learning. Journal

The program is written in extended Fortran 77 making use of anumber of VAX extensions. It runs on SUN workstations f77 and onAbsoft Fortran 77 (available for Windows) and on the free g77compiler. but may have hang ups on other f77 compilers. If youfind such problems and fixes for them, please let me know.

Random forests computesclassification and class probabilitiesintrinsic test set error computationprincipal coordinates to use as variables.variable importance (in a number of ways)proximity measures between casesa measure of outlyingnessscaling displays for the data

The last three can be done for the unsupervised case i.e. no classlabels. I have used proximities to cluster data and they seem to doa reasonable job. The new addition uses the proximities to dometric scaling of the data. The resulting pictures of the data areinteresting and useful.

The first part of this manual contains instructions on how to set upa run of random forests V3.1. The second part contains the noteson the features of random forests V3.1 and how they work.

Page 2: Manual On Setting Up, Using, And Understanding Random ...breiman/Using_random_forests_V3.1.pdf · To find out how this program works, read my paper "Random ... addition of a synthetic

I. Setting Parameters

The first seven lines following the parameter statement need to befilled in by the user.

Line 1 Describing The Data

m d i m =number of variablesnsample0=number of cases (examples or instances) in the datanclass=number of classesmaxcat=the largest number of values assumed by a categorical

variable in the datantest=the number of cases in the test set. NOTE: Put ntest=1 if

there is no test set. Putting ntest=0 may cause compiler complaints .

labelts=0 if the test set has no class labels, 1 if the test set has classlabels.

iaddcl=0 if the data has class labels. If not, iaddcl=1 or 2adds a synthetic class as described below

If their are no categorical variables in the data set maxcat=1. Ifthere are categorical variables, the number of categories assumedby each categorical variable has to be specified in an integer vectorcalled cat, i.e. setting cat(5)=7 implies that the 5th variable is acategorical with 7 values. If maxcat=1, the values of cat areautomatically set equal to one. If not, the user must fill in thevalues of cat in the early lines of code.

For a J-class problem, random forests expects the classes to benumbered 1,2, ...,J. For an L valued categorical, it expects thevalues to be numbered 1,2, ... ,L. At present, L must be less than orequal to 32.

A test set can have two purposes--first: to check the accuracy of RFon a test set. The error rate given by the internal estimate will bevery close to the test set error unless the test set is drawn from adifferent distribution. Second: to get predicted classes for a set ofdata with unknown class labels. In both cases the test set musthave the same format as the training set. If there is no class labelfor the test set, assign each case in the test set label classs #1, i.e.put cl(n)=1, and set labelts=0. Else set labelts=1.

If the data has no class labels, addition of a synthetic class enables it

Page 3: Manual On Setting Up, Using, And Understanding Random ...breiman/Using_random_forests_V3.1.pdf · To find out how this program works, read my paper "Random ... addition of a synthetic

it to be treated as a two-class problem with nclass=2. Settingiaddclass=1 forms the synthetic class by independent samplingfrom each of the univariate distributions of the variables in theoriginal data. Setting iaddclass=2 forms the synthetic class byindependent sampling from uniforms such that each uniform hasrange equal to the range of the corresponding variable.

Line 2 Setting up the run

jbt=number of trees to growmtry=number of variables randomly selected at each nodelook=how often you want to check the prediction erroripi=set priorsndsize=minimum node size

jb t :this is the number of trees to be grown in the run. Don't bestingy--random forests produces trees very rapidly, and it does nothurt to put in a large number of trees. If you want auxiliaryinformation like variable importance or proximities growa lot of trees--say a 1000 or more. Sometimes, I run out to 5000trees if there are many variables and I want the variablesimportances to be stable.

m t r y :this is the only parameter that requires some judgment to set, butforests isn't too sensitive to its value as long as it's in the right ballpark. I have found that setting mtry equal to the square root ofmdim gives generally near optimum results. My advice is to beginwith this value and try a value twice as high and half as lowmonitoring the results by setting look=1 and checking the internaltest set error for a small number of trees. With many noisevariables present, mtry has to be set higher.

l o o k :random forests carries along an internal estimate of the test seterror as the trees are being grown. This estimate is outputted tothe screen every look trees. Setting look=10, for example, gives theinternal error output every tenth tree added. If there is a labeledtest set, it also gives the test set error. Setting look=jbt+1eliminates the output. Do not be dismayed to see the error ratesfluttering around slightly as more trees are added. Their behavior

Page 4: Manual On Setting Up, Using, And Understanding Random ...breiman/Using_random_forests_V3.1.pdf · To find out how this program works, read my paper "Random ... addition of a synthetic

is analagous to the sequence of averages of the number of heads intossing a coin.

ipi: pi is an real-valued vector of length nclass which sets priorprobabilities for classes. ipi=0 sets these priors equal to the classproportions. If the class proportions are very unbalanced, you maywant to put larger priors on the smaller classes. If differentweightings are desired, set ipi=1 and specify the values of the {pi(j)}early in the code. These values are later normalized, so settingpi(1)=1, pi(2)=2 weights a class 2 instance twice as much as a class1 instance. The error rates reported are an unweighted count ofmisclassified instances.

ndsize: setting this to the value k means that no node with fewerthan k cases will be split. The default that always gives goodperformances is ndsize=1. In large data sets, memory requirementswill be less and speed enchanced if ndsize is set larger. Usually, thisresults in only a small loss of accuracy for large data sets.

Line 3 Options on Variable Importance

im p =1 turns on the variable importances methods described below.

impstd=1 gives the standard imp outputimpmargin=1 gives, for each case, a measure of the effect ofnoising up each variableimpgraph=1 gives for each variable, a plot of the effect of the variable on the class probabilities.

impstd=1 computes and prints the following columns to a filei) variable numbervariables importances computed as:ii) The % rise in error over the baseline error.iii) 100* the change in the margins averaged over all casesiv) The proportion of cases for which the margin is decreased minus the proportion of increases.v) The gini increase by variable for the run

impgraph=1 computes and prints out the columns for eachvariable m--

i) variable number i.e. mii) sorted values of x(m) from lowest to highestiii-iii+nclass) effect of x(m) on the probabilities of class j.

Page 5: Manual On Setting Up, Using, And Understanding Random ...breiman/Using_random_forests_V3.1.pdf · To find out how this program works, read my paper "Random ... addition of a synthetic

Line 4 Options based on proximities

iprox=1 turns on the computation of the intrinsic proximitymeasures between any two cases . This has to be turned on forthe following options to work.

noutlier=1 computes an outlyingness measure for all cases in thedata. If iaddcl=1 then the outlyingness measure is computed onlyfor the original data. The output has the columns :

i) classii) case numberiii) measure of outlyingness

iscale=1 computes scaling coordinates based on the proximitymatrix. If iaddcl is turned on, then the scaling is outputted only forthe original data. The output has the columns:

i) case numberi) true classiii) predicted class.iv) 0 if ii)=iii), 1 otherwisev-v+msdim ) scaling coordinates

m d i m s c is the number of scaling coordinates to be extracted.Usually 4-5 is sufficient

Line 5 Transform to Principal Coordinates

ipc=1 takes the x-values and computes principal coordinates fromthe covariance matrix of the x's. These will be the new variables forRF to operate on. This will not work right if some of the variablesare categorical.

mdimpc: This is the number of principal components to extract.It has to be 1 then the categorical values need to be filled in.If ipi=1, the user needs to specify the relative weights of the classes.

File names need to be specified for all output. This is importantsince a chilling message after a long run is "file not specified" orsomething similar.

REMARKS:

The proximities can be used in the clustering program of yourchoice. Their advantage is that they are intrinsic rather than an adhoc measure. I have used them in some standard and home-brewclustering programs and gotten reasonable results. The proximitiesbetween class 1 cases in the unsupervised situation can be used tocluster. Extracting the scaling coordinates from the proximities andplotting scaling coordinate i versus scaling coordinate jgives illuminating pictures of the data. Usually, i=1 and j=2 give themost information (see the notes below).

There are four measures of variable importance: They complementeach other. Except for the 4th they are based on the test sets left outon each tree construction. On a microarray data with 5000variables and less than 100 cases, the different measures single outmuch the same variables (see notes below). But I have found onesynthetic data set where the 3rd measure was more sensitive thanthe first three.

Sometimes, finding the effective variables requires some hunting. Ifthe effective vzriables are clear-cut, then the first measure will findthem. But if the number of variables is large compared to thenumber of cases, and if the predictive power of the individualvariables is small, the other measures can be useful.

Random forests does not overfit. You can run as many trees as youwant. Also, it is fast. Running on a 250mhz machine, the currentversion using a training set with 800 cases, 8 variables, and mtry=1,constructs each tree in .1 seconds. On a training set with 2200cases, 11 variables, and mtry=3, each tree is constructed in .2seconds. It takes 4 seconds per tree on a training set with 15000cases and 16 variables with mtry=4, while also making computationsfor a 5000 member test set.

Page 8: Manual On Setting Up, Using, And Understanding Random ...breiman/Using_random_forests_V3.1.pdf · To find out how this program works, read my paper "Random ... addition of a synthetic

The present version of random forests does not handle missingvalues. A future version will. It is up to the user to decided how todeal with these. My current preferred method is to replace eachmissing value by the median of its column and each missingcategorical by the most frequent value in that categorical. Myimpression is that because of the randomness and the many treesgrown, filling in missing values with a sensible values does not effectaccuracy much.

For large data sets, if proximities are not required, the majormemory requirement is the storage of the data itself, and the threeinteger arrays a,at,b. If there are less than 64,000 cases, these latterthree may be declared integer*2 (non-negative). Then the totalstorage requirement is about three times the size of the data set. Ifproximities are calculated, storage requirements go up by thesquare of the number of cases times eight bytes (double precision).

Outline Of How Random Forests Works

Usual Tree Construction--Cart

Node=subset of data. The root node contains all data.

At each node, search through all variables to findbest split into two children nodes.

Split all the way down and then prune tree up toget minimal test set error.

Random Forests Construction

Root node contains a bootstrap sample of data of same size asoriginal data. A different bootstrap sample for each tree to begrown.

An integer K is fixed, K



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3